64 research outputs found

    Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

    Full text link
    Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in low-resource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors.In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity)

    The Scope and the Sources of Variation in Verbal Predicates in English and French

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 199-210. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Digitising Swiss German : how to process and study a polycentric spoken language

    Get PDF
    Swiss dialects of German are, unlike many dialects of other standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety and that it is subject to considerable regional variation. This paper presents the ArchiMob corpus, a freely available general-purpose corpus of spoken Swiss German based on oral history interviews. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first present the modalities of access of the corpus for linguistic, historic and computational research. We then describe how the documents were transcribed, segmented and aligned with the sound source. This work involved a series of experiments that have led to automatically annotated normalisation and part-of-speech tagging layers. Finally, we present several case studies to motivate the use of the corpus for digital humanities in general and for dialectology in particular.Peer reviewe

    ArchiMob : A multidialectal corpus of Swiss German spontaneous speech

    Get PDF
    Alemannische Dialektologie – Forschungsstand und Perspektiven. SonderheftPeer reviewe

    Automatic interlinear glossing as two-level sequence classification

    Get PDF
    We discuss the aspect of synchronisation in the language design and implementation of the asynchronous data flow language S-Net. Synchronisation is a crucial aspect of any coordination approach. S-Net provides a particularly simple construct, the synchrocell. As a primitive S-Net language construct synchrocell implements a one-off synchronisation of two data items of different type on a stream of such data items. We believe this semantics captures the essence of synchronisation, and no simpler design is possible. While the exact built-in behaviour as such is typically not what is required by S-Net application programmers, we show that in conjunction with other language features S-Net synchrocells meet typical demands for synchronisation in streaming networks quite well. Moreover, we argue that their simplistic design, in fact, is a necessary prerequisite to implement an even more interesting scenario: modelling state in streaming networks of stateless components. We finish with the outline of an efficient implementation by the S-Net runtime system

    Jezična akomodacija na Twitteru: Primjer Srbije

    Get PDF
    U ovom radu istražujemo fenomen jezične akomodacije kod srpskih korisnika Twittera analizirajući geokodirane poruke objavljene u razdoblju između 2013. i 2016. godine na području Bosne i Hercegovine, Crne Gore, Hrvatske i Srbije. Jezičnu produkciju korisnika Twittera opi- sujemo s pomoću 16 varijabli za koje je poznato da variraju među govornicima policentričnog makrojezika BCHS. Uspoređujemo jezičnu produkciju mobilnih srpskih korisnika Twittera s produkcijom nemobilnih srpskih korisnika, kao i produkciju mobilnih korisnika u Srbiji i izvan nje. Dok prva analiza djelomično podržava teoriju akomodacije, druga analiza ne daje nikakve naznake tog fenomena

    Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

    Full text link

    Composition of lipid extract of wheat, corn and sunflower harvest residues

    Get PDF
    Usled stalnog porasta broja stanovnika raste i potreba za hranom u svetu, što dovodi do povećavanja obradivih površine pod žitaricama i uljaricama. Raste i količina žetvenih ostataka koji se najčešće spaljuju. Spaljivanje žetvenih ostataka predstavlja veliki ekološki rizik, sa jedne strane, jer je čest uzročnik požara, dok sa druge strane predstavlja vrednu biomasu koja ostaje neiskorišćena. U poslednjih nekoliko godina je primećen trend spaljivanja ostataka na polju što dovodi do zagađenja vazduha i predstavlja opasnost po zdravlje stanovništva. Žetveni ostaci sadrže različite komponente koje bi mogle naći svoju primenu u prehramebnoj i farmaceutskoj industriji. Analizom sastava lipidnog ekstrakta žetvenih ostataka utvrđeno je prisustvo biološki vrednih komponenata koje dalje mogu naći svoju primenu u proizvodnji mesnih prerađevina sa poboljšanom oksidativnom stabilnošću, boljom održivošću, poboljšanim sastavom masnih kiselina, kao i novim formulama prirodne kozmetike.Rad u istaknutom nacionalnom časopisu (M52
    corecore